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Abstract 



We consider the question of the stability of evolutionary algorithms to gradual changes, 
or drift, in the target concept. We define an algorithm to be resistant to drift if, for 
some inverse polynomial drift rate in the target function, it converges to accuracy 1 — e 
with polynomial resources, and then stays within that accuracy indefinitely, except with 
probability e at any one time. We show that every evolution algorithm, in the sense of 
Valiant [l^l , can be converted using the Correlational Query technique of Feldman [0| , into 
such a drift resistant algorithm. For certain evolutionary algorithms, such as for Boolean 
conjunctions, we give bounds on the rates of drift that they can resist. We develop some 
1 I new evolution algorithms that are resistant to significant drift. In particular, we give an 

algorithm for evolving linear separators over the spherically symmetric distribution that is 
resistant to a drift rate of 0(e/7i), and another algorithm over the more general product 
normal distributions that resists a smaller drift rate. 

The above translation result can be also interpreted as one on the robustness of the notion of 
evolvability itself under changes of definition. As a second result in that direction we show 
\^^ ' that every evolution algorithm can be converted to a quasi-monotonic one that can evolve 

\^^ I from any starting point without the performance ever dipping significantly below that of 

If^ . the starting point. This permits the somewhat unnatural feature of arbitrary performance 

CO ' degradations to be removed from several known robustness translations. 

in 

S . 1 Overview 

The evolvability model introduced by Valiant |19| was designed to provide a quantitative theory 

for studying mechanisms that can evolve in populations of realistic size, in a reasonable number of 

•Jr^ . generations through the Darwinian process of variation and selection. It models evolving mecha- 

^\f ' nisms as functions of many arguments, where the value of a function represents the outcome of the 

JH , mechanism, and the arguments the controlling factors. For example, the function might determine 

the expression level of a particular protein given the expression levels of related proteins. Evolution 

is then modeled as a restricted form of learning from examples, in which the learner observes only 

the empirical performance of a set of functions that are feasible variants of the current function. 

The performance of a function is defined as its correlation with the ideal function, which specifics 

for every possible circumstance the behavior that is most beneficial in the current environment for 

the evolving entity. 

The evolution process consists of repeated applications of a random variation step followed by a 
selection step. In the variation step of round i, a polynomial number of variants of the algorithm's 
current hypothesis r^ are generated, and their performance empirically tested. In the selection step, 
one of the variants with high performance is chosen as r^+i. An algorithm therefore consists of 
both a procedure for describing possible variants and as well as a selection mechanism for choosing 
among the variants. The algorithm succeeds if it produces a hypothesis with performance close to 
the ideal function using only a polynomial amount of resources (in terms of number of generations 
and population size). 
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The basic model as defined in Valiant [l^ is coneerned with the evolution of Boolean functions 
using representations that are randomized Boolean functions. This has been shown by Feldman 
[Tol l to be a highly robust class under variations in definition, as is necessary for any computational 
model that aims to capture the capabilities and limitations of a natural phenomenon. This model 
has also been extended to allow for representations with real number values, in which case a range 
of models arise that diflfer according to whether the quadratic loss or some other metric is used in 
evaluating performance [13, [l3l • Our interest here remains with the original Boolean model, which 
is invariant under changes of this metric. 

In this paper we consider the issue of stability of an evolution algorithm to gradual changes, or 
drift, in the target or ideal function. Such stability is a desirable property of evolution algorithms 
that is not explicitly captured in the original definition. We present two main results in this paper. 
First, for specific evolution algorithms we quantify how resistant they are to drift. Second, we show 
that evolutionary algorithms can be transformed to stable ones, showing that the evolutionary model 
is robust also under modifications that require resistance to drift. 

The issue of resistance to drift has been discussed informally before in the context of evolution 
algorithms that are monotone in the sense that their performance is increasing, or at least non- 
decreasing, at every stage [13, [l3l . We shall therefore start by distinguishing among three notions 
of monotonicity in terms of properties that need to hold with high probability: (i) quasi-monotonic, 
where for any e the performance never goes more than e below that of the starting hypothesis tq, 
(ii) monotonic, where the performance never goes below that of tq, and (iii) strictly monotonic, 
where performance increases by at least an inverse polynomial amount at each step. Definition (ii) 
is essentially Fcldman's [10[ and definition (iii) is implicit in Michael [171 . 

We define a notion of an evolution algorithm being stable to drift in the sense that for some 
inverse polynomial amount of drift, using only polynomial resources, the algorithm will converge to 
performance 1 — e, and will stay with such high performance in perpetuity in the sense that at every 
subsequent time, except with probability e, its performance will be at least 1 — e. 

As our main result demonstrating the robustness of the evolutionary model itself, we show, 
through the simulation of query learning algorithms [^], that for every distribution D, every function 
class that is evolvable in the original definition, is also evolvable by an algorithm that is both (i) 
quasi-monotonic, and (ii) stable to some inverse polynomial amount of drift. While the definitions 
allow any small enough inverse polynomial drift rate, they require good performance in perpetuity, 
and with the same representation class for all e. Some technical complications arise as a result of 
the latter two requirements. 

As a vehicle for studying the stability of specific algorithms, we show that there are natural 
evolutionary algorithms for linear separators over symmetric distributions and over the more general 
product normal distributions. Further we formulate a general result that states that for any strictly 
monotonic evolution algorithm, where the increase in performance at every step is defined by an 
inverse polynomial b, one can determine upper bounds on the polynomial parameters of the evolution 
algorithm, namely those that bound the generation numbers, population sizes, and sample sizes, 
and also a lower bound on the drift that can be resisted. We illustrate the usefulness of this 
formulation by applying it to show that our algorithms for linear separators can resist a significant 
amount of drift. We also apply it to existing algorithms for evolving conjunctions over the uniform 
distribution, with or without negations. We note that the advantages of evolution algorithms that 
use natural representations, over those obtained through simulations of query learning algorithms, 
may be quantified in terms of how moderate the degrees are of the polynomials that bound the 
generation number, population size, sample size and (inverse) drift rate of these algorithms. These 
results appear in Sections [S] and [7] and may be read independently of Section [5l 

All omitted details and proofs appear in the appendix. 

2 The Computational Model of Evolution 

In this section, we provide an overview of the original computational model of evolution (Valiant 
[l9l |. where further details can be found). Many of these notions will be fanfiliar to readers who are 
acquainted with the PAC model of learning [l^ . 

2.1 Basic Definitions 

Let A" be a space of examples. A concept class C over A' is a set of functions mapping elements in 
A" to {—1,1}. A representation class TZ over X consists of a set of (possibly randomized) functions 
from X to { — 1, 1} described in a particular language. Throughout this paper, we think of C as 
the class of functions from which the ideal target / is selected, and 7?, as a class of representations 
from which the evolutionary algorithm chooses an r to approximate /. We consider only classes of 



representations that can be evaluated efficiently, that is, classes TZ such that for any r G TZ and any 
X € X, r{x) can be evaluated in time polynomial in the size of x. 

We associate a complexity parameter n with X, C, and TZ. This parameter indicates the number of 
dimensions of each element in the domain. For example, we might define Xn to be {—1, 1}", Cn to be 
the class of monotone conjunctions over n variables, and TZn to be the class of monotone conjunctions 
over n variables with each conjunction represented as a list of variables. Then C ~ {Cn}^=i and 
TZ = {TZn\'^=i are really ensembles of classesQ Many of our results depend on this complexity 
parameter n. However, we drop the subscripts when the meaning is clear from context. 

The performance of a representation r with respect to the ideal target / is measured with respect 
to a distribution T) over examples. This distribution represents the relative frequency with which 
the organism faces each set of conditions in X . Formally, for any pair of functions /lA'^-j — 1,1}, 
riA"— >■{— 1,1}, and distribution D over X , we define the performance of r with respect to / as 

Perf/(r,P) = E,^i,[/(x)r(x)] = 1 - 2erri,(/,r) , 

where eTTx){f,r) = PTix^vifix) ^ ''(a^)) is the 0/1 error between / and r. The performance thus 
measures the correlation between / and r and is always between — 1 and 1 . 

A new mutation is selected after each round of variation based in part on the observed fitness of 
the variants, i.e., their empirical correlations with the target on a polynomial number of examples. 
Formally, the empirical performance of r with respect to / on a set of examples xi,- ■ ■ ,Xs chosen 
independently according to I? is a random variable defined as (1/s) X]i=i fi^i)'''i^i)- 

We denote by e an accuracy parameter specifying how close to the ideal target a representation 
must be to be considered good. A representation r is a good approximation of / if Perf / (r, V) > 1 — e 
(or equivalently, if errxi{f,r) < e/2). We allow the evolution algorithm to use resources that are 
polynomial in both 1/e and the dimension n. 

2.2 Model of Variation and Selection 

An evolutionary algorithm £ determines at each round i which set of mutations of the algorithm's 
current hypothesis rj;_i should be evaluated as candidates for r^, and how the selection will be made. 
The algorithm £ = (7?., Neigh, /^, i, s) is specified by the following set of components: 

• The representation class TZ ~ {TZn}'^^i specifies the space of representations over X from which 
the algorithm may choose functions r to approximate the target /. 

• The (possibly randomized) function Neigh(r, e) specifies for each r G TZn the set of representa- 
tions r' e TZn into which r can randomly mutate. This set of representations is referred to as 
the neighborhood of r. For all r and e, it is required that r g Neigh(r, e) and that the size of 
the neighborhood is upper bounded by a polynomial. 

• The function ^(r, r', e) specifies for each r € TZn and each r' G Neigli(r, e) the probability that r 
mutates into r'. It is required that for all r and e, for all r' G Neigh(r, e), fi{r, r', e) > l/p{n, 1/e) 
for a polynomial p. 

• The function t{r, e), referred to as the tolerance of £, determines the difference in performance 
that a mutation in the neighborhood of r must exhibit in order to be considered a "beneficial" , 
"neutral" , or "deleterious" mutation. The tolerance is required to be bounded from above and 
below, for all representations r, by a pair of inverse polynomials in n and 1/e. 

• Finally, the function s(r, e), referred to as the sample size, determines the number of examples 
used to evaluate the empirical performance of each r' G Neigh(r, e). The sample size must also 
be polynomial in n and 1/e. 

The functions Neigh, ^, i, and s must all be computable in time polynomial in n and 1/e. 

Wc arc now ready to describe a single round of the evolution process. For any ideal target 
/ G C, distribution P, evolutionary algorithm £ = (7?.,Neigh, ^,f, s), accuracy parameter e, and 
representation r^-i, the mutator M(/, P, £, e, r^-i) returns a random mutation r^ G Neigh(ri_i, e) 
using the following selection procedure. First, for each r G Neigh(ri_i, e), the mutator computes 
the empirical performance of r with respect to / on a sample of size so Call this v{r). Let 

Bene = {r | ?' G Neigh(ri_i, e), i;(7') > v{r,^i) + t(ri_i,e)} 



^As in the PAC model, n should additionally upper bound the size of representation of the function to 
be learned, but for brevity we shall omit this aspect here. 

^We assume a single sample is used to evaluate the performance of all neighbors and r^-i, but one could 
interpret the model as using independent samples for each representation. This would not change our results. 



be the set of "beneficial" mutations and 

Neut = {r I r e Neigh(rj_i, e), \v{r) - w(ri_i)| < i(ri_i,e)} 

be the set of "neutral" mutations. If at least one beneficial mutation exists, then a mutation r is 
chosen from Bene as the survivor r^ with relative probability /i(ri_i, r, e). If no beneficial mutation 
exists, then a mutation r is chosen from Neut as the survivor r^, again with probability proportional 
to /i(ri_i, r, e). Notice that, by definition, r.i^i is always a member of Neut, and hence a neutral 
mutation is guaranteed to exist. 

2.3 Putting It All Together 

A concept class C is said to be evolvable by algorithm £ over distribution V if for every target f £ C, 
starting at any tq € TZ, the sequence of mutations defined by £ converges in polynomial time to a 
representation r whose performance with respect to / is close to 1. This is formalized as follows. 



Definition 1 (Evolvability [19|) For a concept class C, distribution T), and evolutionary algo- 
rithm £ = {TZ, Neigh, fi,t^ s), we say that C is evolvable over I? by £" if there exists a polyno- 
mial g{n, 1/e) such that for every n € N, f G C„, tq € TZn, o-nd e > 0, with probability at 
least 1 — e, a sequence rQ,ri,r2,- ■ ■ generated by setting r.i = M(/, I?,f , e,ri_i) for all i satisfies 
P^rff{rg(^ni/^),V) > 1 - e. 

We say that the class C is evolvable over V if there exists a valid evolution algorithm £ = 
{TZ, Neigh, n, t, s) such that C is evolvable over V by £. The polynomial g{n, 1/e), referred to as the 
generation polynomial, is an upper bound on the number of generations required for the evolution 
process to converge. If the above definition holds only for a particular value (or set of values) for 
To, then we say that C is evolvable with initialization. 

2.4 Alternative Models 

Various alternative formulations of the basic computational model of evolution described here have 
been studied. Many have been proved equivalent to the basic model in the sense that any concept 
class C evolvable in the basic model is evolvable in the alternative model and vice versa. Here we 
briefly discuss some of the variations that have been considered. 

The performance measure Perf/(r, I?) is defined in terms of the 0/1 loss. Alternative perfor- 
mance measures based on squared loss or other loss functions have been studied in the context of 
evolution [ifl Eli US • However, these alternative measures are identical to the original when / and 
r are (possibly randomized) binary functions, as we have assumed. (When the model is extended 
to allow real- valued function output, evolvability with a performance measure based on any non- 
linear loss function is strictly more powerful than evolvability with the standard correlation-based 
performance measure \W^ . We do not consider that extension in this work.) 

Alternate rules for determining how a mutation is selected have also been considered. In par- 
ticular, Feldman [l^l showed that evolvability using a selection rule that always chooses among the 
mutations with the highest or near highest empirical performance in the neighborhood is equivalent 
to evolvability with the original selection rule based on the classes Bene and Neut. He also discussed 
the performance of "smooth" selection rules, in which the probability of a given mutation surviving 
is a smooth function of its original frequency and the performance of mutations in the neighborhood. 

Finally, Feldman P, |l^ showed that fixed-tolerance evolvability, in which the tolerance i is a 
function of only n and 1/e but not the representation ri_i, is equivalent to the basic model. 

3 Notions of Monotonicity 



Feldman jlOl lll| introduced the notion of monotonic evolution in the computational model described 
above. His notion of monotonicity, restated here in Definition [2J requires that with high probability, 
the performance of the current representation r^ never drops below the performance of the initial 
representation tq during the evolution process. 

Definition 2 (Monotonic Evolution) An evolution algorithm £ monotonically evolves a class C 
over a distribution V if £ evolves C over V and with probability at least 1 — e, for all i < g{n, 1/e), 
Perf Ari,!)) > PerfJro,!)), where g{n, 1/e) and ro,ri, • • • are defined as in Definitions^ 

When explicit initialization of the starting representation rg is prohibited, this is equivalent to 
requiring that Perfj(ri,2?) > Perf /(r^-i, I?) for all i < g{n,l/e). In other words, it is equivalent 
to requiring that with high probability, performance never decreases during the evolution process. 



(Feldman showed that if representations may produce real- valued output and an alternate perfor- 
mance measure based on squared loss in considered, then any class C that is efficiently SQ learnable 
over a known, efficiently samplable distribution T) is monotonically evolvable over D.) 

A stronger notion of monotonicity was used by Michael [l3|, who, in the context of real- valued 
representations and quadratic loss functions, developed an evolution algorithm for learning 1-decision 
lists in which only beneficial mutations are allowed. In this spirit, we define the notion of strict 
monotonic evolution, which requires a significant (inverse polynomial) performance increase at every 
round of evolution until a representation with sufficiently high performance is found. 

Definition 3 (Strict Monotonic Evolution) An evolution algorithm £ strictly monotonically 
evolves a class C over a distribution V if £ evolves C over V and, for a polynomial m, with 
probability at least 1 — e, for all i < g(n,l/e), either Perf Ari-i^T)) > 1 — e or PerfAri,!)) > 
Perf Ari-ijT)) -\- l/m{n, 1/e), where g(n, 1/e) and ro,ri, • • • are defined as in Definition]^ 

Below we show that a class C is strictly monotonically evolvable over a distribution V using 
representation class TZ if and only if it is possible to define a neighborhood function satisfying the 
property that for any r € TZ and / € C, if Perf j(r, I?) is not already near optimal, there exists a 
neighbor r' of r such that r' has a noticeable (again, inverse polynomial) performance improvement 
over r. We call such a neighborhood function strictly beneficial. The idea of strictly beneficial 
neighborhood functions plays an important role in developing our results in Sections [5] and [T] 
Feldman [ll| uses a similar notion to show monotonic evolution under square loss. 

Definition 4 (Strictly Beneficial Neighborhood Function) For a concept class C, distribu- 
tion V, and representation class TZ, we say that a (possibly randomized) function Neigh is a 
strictly beneficial neighborhood function if the size of Neigh(r, e) is upper bounded by a poly- 
nomial p(n, 1/e), and there exists a polynomial b{n, 1/e) such that for every n € N, f € Cn, 
r G TZn, and e > 0, if PerfAr,T>) < 1 — e/2, then there exists a r' G Neigh{r,e) such that 
PerfAr',T>) > PerfAr,T>) + l/b{n, 1/e). We refer to bin, 1/e) as the benefit polynomial. 

Lemma 5 For any concept class C, distribution T> , and representation class TZ, if Neigh is a strictly 
beneficial neighborhood function for C, T>, and TZ, then there exist valid functions fi, t, and s such 
that C is strictly monotonically evolvable over T> by £ = (TZ, Neigh, fi,t, s). If a concept class C is 
strictly monotonically evolvable over T> by £ = (TZ, Neigh, fi,t, s), then Neigh is a strictly beneficial 
neighborhood function for C, T>, and TZ. 

The proof of the second half of the lemma is immediate; the definition of strictly mono- 
tonic evolvability requires that for any initial representation rg €! TZ, with high probability ei- 
ther Perf/(ro,X') > 1 — e/2 or Perf/(ri,I?) > Perf/(ro,I?) + \/m(n,2/e) for a polynomial m. 
Thus if Perf/(ro,I?) < 1 — e/2 there must exist an ri in the neighborhood of r^ such that 
Perf/(ri,I?) > Perf/(ro,X') -l- \/m{n,2/€). The key idea behind the proof of the first half is 
to show that it is possible to set the tolerance t(r, e) in such a way that with high probability. Bene 
is never empty and there is never a representation in Bene with performance too much worse than 
that of the beneficial mutation guaranteed by the definition of the strictly beneficial neighborhood 
function. This implies that the mutation algorithm is guaranteed to choose a new representation 
with a significant increase in performance at each round. 

Finally, we define quasi-monotonic evolution. This is similar to the monotonic evolution, except 
that the performance is allowed to go slightly below that of tq. In Section 15.71 we show that 
this notion can be made universal, in the sense that every evolvable class is also evolvable quasi- 
monotonically. 

Definition 6 (Quasi-Monotonic Evolution) An evolution algorithm quasi-monotonically 
evolves a class C over T> if £ evolves C over T> and with probability at least 1 — e, for all i < g{n, 1/e), 
PerfAri,T>) > PerfArQ,T>) — e, where g{n, 1/e) and ro,ri, ■ ■ ■ are defined as in Definition]^ 

4 Resistance to Drift 

There are many ways one could choose to formalize the notion of drift resistance. Our formalization 
is closely related to ideas from the work on tracking drifting concepts in the computational learning 
literature. The first models of concept drift were proposed around the same time by Helmbold and 
Long [12| and Kuh ct al. |16| . In both of these models, at each time i, an input point Xi is drawn 
from a fixed but unknown distribution T) and labeled by a target function fi G C. It is assumed 



that the error of fi with respect to /i_i on 2? is less than a fixed value A. Helmbold and Long 
[l3 | showed that a simple algorithm that chooses a concept to (approximately) minimize error over 
recent time steps achieves an average error of OjVAd) where d is the VC dimension of Cl3 More 
general models of drift have also been proposed @, Q ■ 

Let fi G C denote the ideal function on round i of the evolution process. Following Helmbold 
and Long [l^l, we make the assumption that for all i, err-pifi-i, fi) < A for some value A. This is 
equivalent to assuming that Perf /._j(/i,2?) > 1 — 2A. Call a sequence of functions satisfying this 
condition a A-drifting sequence. We make no other assumptions on the sequence of ideal functions. 

Definition 7 (Evolvability v^ritii Drifting Targets) For a concept class C, distribution V, and 
evolution algorithm £ = (7?., Neigh, fi, t, s), we say that C is evolvable with drifting targets over T) by 
£ if there exist polynomials g{n, 1/e) and d{n, 1/e) such that for every n € N, rg € 7?.„, and e > 0, 
for any A < l/d{n, 1/e), and every A-drifting sequence /i, /2, . . . (with fi G C„ for all i), ifro,ri,... 
is generated by £ such that ri — M(/i_i,I?, £", e, r^-i), then for all £ > g{n,l/e), with probability at 
least 1 — e, Perff (ri^V) > 1 — e. We refer to d{n, 1/e) as the drift polynomial. 

As in the basic definition, we say that the class C is evolvable with drifting targets over T) if there 
exists a valid evolution algorithm £ = (7?,, Neigh, /i, t, s) such that C is evolvable with drifting targets 
over T) by £. The drift polynomial specifies how much drift the algorithm can tolerate. 

Our first main technical result, Theorem [51 relates the idea of monotonicity described above 
to drift resistance by showing that given a strictly beneficial neighborhood function for a class C, 
distribution P, and representation class 7?., one can construct a mutation algorithm £ such that C is 
evolvable with drifting targets over T) by £. The tolerance t and sample size s of £ and the resulting 
generation polynomial g and drift polynomial d directly depend only on the benefit polynomial h as 
described below. The proof is very similar to the proof of the first half of Lemma [5j Once again the 
key idea is to show that it is possible to set the tolerance such that with high probability. Bene is 
never empty and there is never a representation in Bene with performance too much worse than the 
guaranteed beneficial mutation. This implies that the mutation algorithm is guaranteed to choose 
a new representation with a significant increase in performance with respect to the previous target 
fi-\ at each round i with high probability. As long as fi-\ and /,; are sufficiently close, the chosen 
representation is also guaranteed to have good performance with respect to fi. 

Theorem 8 For any concept class C, distribution V, and representation class TZ, if Neigh is a 
strictly beneficial neighborhood function for C, V, and TZ, then there exist valid functions fi, t, and 
s such that C is evolvable with drifting targets over T) by £ = (TZ, Neigh, fi,t, s). In particular, if 
Neigh is strictly beneficial with benefit polynomial b(n, 1/e), and p{n, 1/e) is an arbitrary polynomial 
upper bound on the size of Neigh{r,e), then C is evolvable with drifting targets over T> with 

• any distributions fi that satisfy fi{r, r' , e) > l/p{n, 1/e) for all r £ TZn, e, and r' £ Neigh{r, e), 

• tolerance function t{r, e) = l/(26(n, 1/e)) for all r G TZn, 

• any generation polynomial g{n, 1/e) > 16b{n, 1/e), 

• any sample size s{n, 1/e) > 128(6(n, 1/e))^ In (2p{n, l/e)g{n, l/e)/e), and 

• any drift polynomial d{n, 1/e) > 166(n, 1/e), which allows drift A < l/(16fe(n, 1/e)). 

In Sections [6] and [71 which can be read independent of Section [5l we appeal to this theorem in 
order to prove that some common concept classes are evolvable with drifting targets with relatively 
large values of A. Using Lemma [SI we also obtain the following corollary. 

Corollary 9 If a concept class C is strictly monotonically evolvable over T>, then C is evolvable with 
drifting targets over T>. 

5 Robustness Results 

Feldman [^ proved that the original model of evolvability is equivalent to a restriction of the sta- 
tistical query model of learning jla | known as learning by correlational statistical queries (CSQ) @. 
We extend Feldman's analysis to show that CSQ learning is also equivalent to both evolvability with 
drifting targets and quasi-monotonic evolvability, and so the notion of evolvability is robust to these 
changes in definition. We begin by bricfiy reviewing the CSQ model. 



Throughout the paper, we use the notation O to suppress logarithmic factors. 



5.1 Learning from Correlational Statistical Queries 

The statistical query (SQ) model was introduced by Kearns [l5l | and has been widely studied due 
to its connections to learning with noise [11, |j] . Like the PAC model, the goal of an SQ learner is 
to produce a hypothesis h that approximates the behavior of a target function / with respect to 
a fixed but unknown distribution T). Unlike the PAC model, the learner is not given direct access 
to labeled examples {x,f{x)), but is instead given access to a statistical query oracle. The learner 
submits queries of the form (f/;, t) to the oracle, where ip : X x {—1, 1} — >■ [—1, 1] is a query function 
and r G [0, 1] is a tolerance parameter. The oracle responds to each query with any value v such that 
|Ex~d['0(2'', f{x))] ^ v\ < T. An algorithm is said to efficiently learn a class C in the SQ model if for 
all n g N, e > 0, and / S C„, and every distribution I?„ over X^ the algorithm, given access to e and 
the SQ oracle for / and !?„, outputs a polynomially computable hypothesis h in polynomial time 
such that err(f, h) < e. Furthermore it is required that each query (ip, r) made by the algorithm 
can be evaluated in polynomial time given access to / and !?„. It is known that any class efficiently 
learnable in the SQ model is efficiently learnable in the PAC model with label noise [15| . 

A query {iP,t) is called a correlational statistical query (CSQ) Q if '4'{^jf{^)) = 4'{^)f{^) for 
some function (j) : X — >■ [—1,1]. An algorithm A is said to efficiently learn a class C in the CSQ 
model if A efficiently learns C in the SQ model using only correlational statistical queries. 

It is useful to consider one additional type of query, the CSQ> query Q. A CSQ> query is 
specified by a triple (</>, 0,t), where (p : X ^ [^li 1] is a query function, is a threshold, and 
T G [0, 1] is a tolerance parameter. When presented with such a query, a CSQ> oracle for target / 
and distribution V returns 1 if ^x~v[4'{x) Hx)] > 6* + r, if ^xr^v[4>{x) f {x)] < — t, and arbitrary 
value of either 1 or otherwise. Fcldman P] showed that if there exists an algorithm for learning C 
over D that makes CSQs, then there exists an algorithm for learning C over T> using CSQ>s of the 
form (0, 0, r) where 9 > t for all queries. Furthermore the number of queries made by this algorithm 
is at most 0(log(l/T)) times the number of queries made by the original CSQ algorithm. 

5.2 Overview of the Reduction 

The construction we present uses Feldman's simulation j9| repeatedly. Fix a concept class C and a 
distribution V such that C is learnable over V in the CSQ model. As mentioned above, this implies 
that there exists a CSQ> algorithm A for learning C over T). Let H be the class of hypotheses from 
which the output of A is chosen. In the analysis that follows, we restrict our attention to the case 
in which A is deterministic. However, the extension of our analysis to randomized algorithms is 
straightforward using Feldman's ideas (see Lemma 4.7 in his paper Q). 

First, we present a high level outline of our reduction. Throughout this section we will use 
randomized Boolean functions. If -0 : X — > [—1,1] is a real valued function, let ^ denote the 
randomized Boolean function such that for every x, E[\['(a;)] = ip{x). It can be easily verified that 
for any function 0(a;), Ex.^i[(t){x)'^{x)] = F,x[(l){x)ip{x)]. For the rest of this section, we will abuse 
notation and simply write real- valued functions in place of the corresponding randomized Boolean 
functions. 

Our representation is of the form r = (1 — e/2)h + (e/2)$. Here h is a hypothesis from H and $ is 
function that encodes the state of the CSQ> algorithm that is being simulated. Feldman's simulation 
only uses the second part. Our simulation runs in perpetuity, restarting Feldman's simulation each 
time it has completed. Since the target functions are drifting over time, if h has high performance 
with respect to the current target function, it will retain the performance for some time steps in 
the future, but not forever. During this time, Feldman's simulation on the $ part produces a new 
hypothesis h' which has high performance at the time this simulation is completed. At this time, 
we will transition to a representation r' = (1 — e/2)h' + (e/2)$, where $ is reset to start Feldman's 
simulation anew. Thus, although the target drifts, our simulation will continuously run Feldman's 
simulation to find a hypothesis that has a high performance with respect to the current target. 

The rest of section [5] details the reduction. First, we show how a single run of A is simulated, 
which is essentially Feldman's reduction with minor modifications. Then we discuss how to restart 
this simulation once it has completed. This requires the addition of certain intermediate states to 
keep the reduction feasible in the evolution model. We also show that our reduction can be made 
quasi-monotonic. Finally, we show how all this can be done using a representation class that is 
independent of e, as is required. This last step is shown in the appendix. 

5.3 Construction of the Evolutionary Algorithm 

We describe the construction of our evolutionary algorithm £. Let r ~ T{n, 1/e) be a polynomial 
lower bound on the tolerance of the queries made by A when run with accuracy parameter e/4. 
Without loss of generality, we may assume all queries are made with this tolerance. Let q = q{n, 1/e) 



be a polynomial upper bound on the number of queries made by A, and assume that A makes exactly 
q queries (if not, redundant queries can be added). Here, we allow our representation class to be 
dependent on e. However, this restriction may be removed (cf. Appendix IA.7.1[) In the remainder 
of this section we drop the subscripts n and e, except where there is a possibility of confusion. 

Following Feldman's notation, let z denote a bit string of length q which records the oracle 
responses to the queries made by A; that is, the ith bit of z is 1 if and only if the answer to the ith 
query is 1. Let \z\ denote the length of z, z' the prefix of z of length i, and Zi the ith bit of z. Since 
A is deterministic, the ith query made by A depends only on responses to the previous i — 1 queries. 
We denote this query by ((/i^i-i, 0j,j~i,r), with 9zi-i > t, as discussed in Section [01 Let h^ denote 
the final hypothesis output by A given query responses z. Since we have chosen to simulate A with 
accuracy parameter e/4, h^ is guaranteed to satisfy Perff{hz,'D) > 1 — e/4 for any function / for 
which the query responses in z are valid. Finally, let a denote the empty string. 

For every i G {I,--- ,q} and z e {0,1}', we define ^z ~ {^/q)J2]=iH^j — ^)4'zi-^{^)j where 
I is an indicator function that is 1 if its input is true and otherwise. For any /i e "H, define 
re[h, z] = (1 — e/2)h{x) + {e/2)^z{x)- Recall that each of these real-valued functions can be treated 
as a randomized Boolean function as required by the evolution model. The performance of this 
function, which we use as our basic representation, is mainly determined by the performance of h, 
but by setting the tolerance parameter low enough, the ^^ part can learn useful information about 
the (drifting) targets by simulating A. 

Let i?e = {re[/i, z] \ h G 71,0 < \z\ < q — 1}. The representations in i?^ will be used for 
simulating one round of A. To reach a state where we can restart the simulation, we will need to 
add intermediate representations. These are defined below. 

Let tu{n, 1/e) be an upper bound on e9zi/(8q) for all i and z'. (This will be a polynomial upper 
bound on all tolerances t that we define below.) Assume for simplicity that K = 2/tu{n, 1/e) is an 
integer. Let wq = r^lh, z], for some h GJi and \z\ = q {wq depends on h and z, but to keep notation 
simple we will avoid subscripts). For k = 1, . . . ,K, define w^ = (1 — k{tu{n, l/e)/2))wo. Notice that 
wk ~ 0, where is a function that can be realized by a randomized function that ignores its input 
and predicts +1 or —1 randomly. Let We = {wi \ wq = re[h,z],h € ^, |z| = q,i £ {0,...,/^}}. 
Finally define TZ^ = R^U W^. For every representation re[h,z\ e R^, we set 

• Neigh(re[/i, z],e) = {r<:[/i, z], re[/i, zO], re[/i, zl]}, 

• ^l{re[h,z\,re[h,z],e) = rj and ^(re[/i, z], r£[ft,, zO], e) = /i(r, r£[ft,, zl], e) = (1 - ?7)/2, 

• t(r,[/i,z],e) = e0,,/(8(z). 

For the remaining representations Wk G W^, with wq ~ Tf\h^ z], we set 

• Neigh(w/f,e) = {wif , re[0, cr]} and Neigh(u'fc,e) = {wfe, zi;fe+i,r<:[/i^,<:, cr]} for aU k < K, 

• ^i{wk,wk) = r] and n{wK,re[0,a]) = 1-1], and iJ.{wk,Wk,e) = rf, ii{wk,'Wk+i,e) = rj-if, and 
^{wk,rf\hz.cTcr]) = I — rj for all k < K, 

• t{wk,e) = tu{n, 1/e). 

Finally, let 77 = e/{4q + 2K), r' = min{(er)/(2g), iu(n, l/e)/8}, and s = 1/(2(t')^) log((69-|- 3A')/e). 
Let £ = (7?.e, Neigh, /i,t, s) with components defined as above. We show that £ evolves C over T) 
tolerating drift of A = (er)/(4g + 2K + 2). This value of drift, while small, is an inverse polynomial 
in n and 1/e as required. The point to note is that the evolutionary algorithm runs perpetually, 
while still maintaining high performance on any given round with high probability. 

For any representation r, we denote by LPE the union of the low probability events that some 
estimates of performance are not within r' of their true value, or that a mutation with relative 
probability less than 2r] (either in Bene or Neut) is selected over other mutations. 

5.4 Simulating the CSQ> Algorithm for Drifting Targets 

We now show that it is possible to simulate a CSQ> algorithm using an evolution algorithm £ 
even when the target is drifting. However, if we simulate a query {(J),9,t) on round i, there is no 
guarantee that the answer to this query will remain valid in future rounds. The following lemma 
shows that by lowering the tolerance of the simulated query below the tolerance that is actually 
required by the CSQ> algorithm, we are able to generate a sequence of query answers that remain 
valid over many rounds. Specifically, it shows that if w is a valid response for the query (0, 9, t /2) 
with respect to /i, then v is also a valid response for the query {(f>,9,T) with respect to fj for any 
je[z-T/(2A),z + r/(2A)]. 

Lemma 10 Let /i,/2, • ■ ■ be a A-drifting sequence with respect to the distribution V over X. For 
any tolerance t, any threshold 9, any indices i and j such that \i — j\ < r/(2A), and any function 
4> : X ^ [-1,1], i/E,^p[0(x)/j(x)] >9 + T, then'&,^v[4>{x)h{x)] > 9 + t/2. Similarly, if 
^.^v[(l>ix)fjix)] <9-T, then ^,^v[(f>{x)f^{x)] <9- t/2. 



We say that a string z is consistent with a target function /, if for aU 1 < i < \z\, Zi is a valid 
response to the query {(j>^i~i,9zi~-i,T), with respect to /. Suppose that the algorithm £ starts with 
representation tq = r^[h,a]. (Recall that a denotes the empty string.) The following lemma shows 
that after q time steps, with high probability it will reach a representation r^[h, z] where \z\ = q and 
z is consistent with the target function fq, implying that z is a proper simulation of ^ on fq. 

Lemma 11 // A < r/(2g), then for any A-drifting sequence /o, /i, . . . , /g, if ro,ri, . . . ,rq is the 
sequence of representations of £ starting at rg = re[ft,,cr], and if the LPE does not occur for q 
rounds, then rq ~ r^[h,z\ where \z\ ~ q and z is consistent with fq. 

The proof uses the following ideas: If the LPE docs not occur, there arc no mutations of the 
form r — > r, so the length of z increases by 1 every round, and also all estimates of performance 
are within r' of their true value. When this is the case, and after observing that r^ [h, z'O] is always 
neutral, it is possible to show that for any round i, (i) if re[h,z'^l] is beneficial, then 1 is a valid 
answer to the ith query with respect to fi, (ii) if r^lh, z*l] is deleterious then is a valid answer for 
the ith query with respect to fi, and (c) if r^lh, z'l] is neutral, then both and 1 are valid answers 
to the ith query. This implies that z^+i is always a valid answer to the ith query with respect to fi, 
and by Lemma [TUl with respect to fq. 

5.5 Restarting the Simulation 

We now discuss how to restart Feldman's simulation once it completes. Suppose we are in a repre- 
sentation of the form r^lh, z], where \z\ = q, and z is consistent with the current target function /. 
Then if hz is the hypothesis output by A using query responses in z, we are guaranteed that (with 
high probability) Perf /(/I2, D) > 1 — e/4. At this point, we would like the algorithm to choose a new 
representation r^[hz,(7], where a is the empty string. The intuition behind this move is as follows. 
The performance of r^ [hz , cr] is guaranteed to be high (and to remain high for many generations) 
because much of the weight is on the hz term. Thus we can use the second term ($„) to restart the 
learning process. After q more time steps have passed, it may be the case that the performance of 
hz is no longer as high with respect to the new target, but the simulated algorithm will have already 
found a different hypothesis that docs have high performance with respect to this new target. 

There is one tricky aspect of this approach. In some circumstances, we may need to restart the 
simulation by moving from r^lh, z] to rf\hz,a] even though z is not consistent with /. This situation 
can arise for two reasons. First, we might be near the beginning of the evolution process when £ 
has not had enough generations to correctly determine the query responses (starting state may be 
r(:[/i, zq] where zq has wrong answers). Second, there is some small probability of failure on any 
given round and we would like the evolutionary algorithm to recover from such failures smoothly. 
In either case, to handle the situation in which hz may have performance below zero (or very close), 
we will also allow rj^ji, z\ to mutate to r^[Q,cr\. 

The required changes from r^[h, z\ to either rf\hz, tr] or rf\0, a] described above may be deleteri- 
ous. To handle this, we employ a technique of Feldman Q, where we first decrease the performance 
gradually (through neutral mutations) until these mutations are no longer deleterious. The repre- 
sentations defined in W^ achieve this. The claim is that starting from any representation of the form 
Wk, we reach either rf\hz, a] or r^[0, a] in at most K — k + 1 steps, with high probability. Further- 
more, since the probability of moving to r^[hz,o'] is very high, this representation will be reached if 
it is ever a neutral mutation (i.e., the LPE does not happen). Thus, the performance always stays 
above the performance of r^ [hz , a] . Lemma [T^] formalizes this claim. 

Lemma 12 If A < tu{n, l/e)/4, then for any A-drifting sequence /o, /i, . . . , fq, if rQ,ri, . . . , rq is 
the sequence of mutations of £ starting at r^ = Wk, then if the LPE does not happen at any time-step, 
there exists a j < K — k-\-l such that rj — r^ [hz,e, c] or rj = r^ [0, a] . Furthermore, for all 1 < i < j , 
Perfj^ (r, , V) > Perf^^ (r, [hz,e, fx] , P) . 

5.6 Equivalence to Evolvability with Drifting Targets 

Combining these results, we prove the equivalence between evolvability and evolvability with drifting 
targets starting from any representation in TZ^^ . The proof we give here uses the representation class 
TZe and therefore assumes that the value of e is known. For the needed generalization to the case 
where TZ = Ue7?,e, Feldman's backsliding trick Q can be used to first reach a representation with 
zero performance, and then move to a representation in 7?.g. Theorem 1131 shows that every concept 
class that is learnablc using CSQs (and thus every class that is evolvable) is evolvable with drifting 
targets. 



Theorem 13 If C is evolvable over distribution T), then C is evolvahle with drifting targets overT). 

Proof: Let ^ be a CSQ> algorithm for learning C over T> with accuracy e/4. A makes q = g(n, 1/e) 
queries of tolerance r and outputs h satisfying Perff{h,D) > 1 — e/4. Let £ be the evolutionary 
algorithm derived from A as described in Section [S751 Recall that K = 2/tu{n, 1/e), let g = 2q+K+l. 
We show that starting from an arbitrary representation ro G TZe, with probability at least 1 — e, 
Perf/ {rg^V) > 1 — e. This is sufficient to show that for all £ > g, with probability at least 1 — e, 
Perf jj (r^ ,2?) > 1 — e, since we can consider the run of £ starting from ri^g. 

With the setting of parameters as described in Section 15. 3[ with probability at least 1 — e, the 
LPE does not occur for g time steps, i.e., all estimates are within t' = rain{{Te)/{2q),tu{n, l/e)/8} 
of their true value and unlikely mutations (those with relative probabilities less that 2ri) arc not 
chosen. Thus, we can apply the results of Lemmas [TT] and [T^l We assume that this is the case for 
the rest of the proof. When A ~ (er)/(4g + 2K + 2), the assumption of Lemmas [TT] and [T2l hold and 
we can apply them. 

First, we argue that starting from an arbitrary representation, in at most q + K steps, we will 
have reached a representation of the form rg[h,a], for some h € Ji. If the start representation is 
r^[h, z] for |z| < (? — 1, then in at most q ~ 1 steps we reach a representation of the form r^[h, z'\ 
with |z'| = (7, in which case by Lemma [T^ the algorithm will transition to representation r^^^ a\ in 
at most X + 1 additional steps. Alternately, if the start representation is w^ for k € {0, . . . , K] as 
defined in Section [5. 3[ then by Lemma [T^ we reach a representation of the form r(:[/i,cr] in at inost 
K ^\ steps. 

Let 771 be the time step when E first reaches the representation of the form rc[/i, a\. Then using 
Lemma [TTl r^a+q = T^\h, z*], where z* is consistent with fm+q- Let h* ~ h^* ^e be the hypothesis out- 
put by the simulated run of A. Then Perf /^^^ {h*,V) > 1 - e/4, and hence Perf /„^^^ (r^ [h* ,(7],'D) > 
1 — 3e/4. For the value of A we are using, for all i < g, Perff. (r^ [h* ^a],!)) > 1 — e. 

From such a representation, when all estimates of performance are within r' of their true value 
and unlikely mutations (those with relative probability < 2r]) do not occur, the performance will 
remain above 1 — e. By Lemma I12[ the algorithm will move from r,ri+q = r^[h, z*] to r^[h*,a] in 
at most K + 1 steps, and during these time steps for any time step i it holds that Perfy^(ri,X') > 
Perf f.{r^[h* , a],!)). Once r^[h*,(j] is reached, for q steps the representations will be of the form 
r(:[h*,z]. For any such time step i, Perf/. (r^,!?) > Pert f^{r^[h* , a],!)). This is because if the 
answers in z are correct (and they will be since the LPE does not happen at any time step), the 
term ^^ is made up of only those functions </>z3-i for which z-' = 1, which are those for which ^^j-i 
has a correlation greater than 0^j-i — r > with the target fi (using Lemma fTO|) . Since as observed 
above the performance of re [h* , a] docs not degrade below 1 — e in the time horizon we are interested 
in Perf /, (r^ , V) > Perff. [r^[h* ,(j]) > I ~ e. ■ 

5.7 Equivalence to Quasi-Monotonic Evolution 

Finally, we show that all evolvable classes are also evolvable quasi-monotonically. In the proof of 
Theorem [T3l we showed that for all £ > g = 2q+K+l, with high probability Perfy^(rf,I?) > 1 — e, so 
quasi-monotonicity is satisfied trivially. Thus we only need to show quasi-monotonicity for the first 
g steps. We will use the same construction as defined in Section |5.3[ with modifications. However, 
this assumes that the representation knows e, since now the trick of having the performance slide 
back to zero would violate quasi-monotonicity. To make the representation class independent of e a 
more complex construction is needed. Details can be found in the appendix. 

Theorem 14 If C is evolvable over distribution V, then C is quasi-monotonically evolvable over T? 
with drifting targets. 

6 Evolving Hyperplanes \vith Drifting Targets 

In this section, we present two alternative algorithms for evolving n-dimensional hyperplanes with 
drifting targets. The first algorithm, which generates the neighbors of a hyperplane by rotating 
it a small amount in one of 2(n — 1) directions, tolerates drift on the order of e/n, but only over 
spherically symmetric distributions. The second algorithm, which generates the neighbors of a 
hyperplane by shifting single components of its normal vector, tolerates a smaller drift, but works 
when the distribution is an unknown product normal distribution. To our knowledge, these are the 
first positive results on evolving hyperplanes in the computational model of evolution. 

Formally, let C„ be the class of all ri-dimensional homogeneous linear scparatorsQ For notational 
convenience, we reference each linear separator in C„ by the hyperplane's n-dimensional unit length 



*A homogeneous linear separator is one that passes through the origin. [a| 



normal vector f G M". For every f € C„ and x G R", we then have that f(x) = 1 if f • x > 0, 
and f(x) = — 1 otherwise. The evolution algorithms we consider in this section use a representation 
class Ti-n also consisting of n-dimensional unit vectors, where r G ??.„ is the normal vector of the 
hyperplane it represents^ Then TZ ~ {v \ ||r||2 = 1}. We describe the two algorithms in turn. 

6.1 An Evolution Algorithm Based on Rotations 

For the rotation-based algorithm, wc define the neighborhood function of r G 7^„ as follows. Let 
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,u"} be an orthonormal basis for R". This orthonormal basis can be chosen arbi- 



trarily (and potentially randomly) as long as u^ = r. Then 

Neigh(r, e) = r U {r' I r' = cos (e/(7r-y/n) ) r ± sin (e/(7r-yn)) u* 
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In other words, each r' G Neigh(r, e) is obtained by rotating r by an angle of e/{TTy/n) in some 
direction. The size of this neighbor set is clearly 2n — 1 . We obtain the following theorem. 

Theorem 15 LetC be the class of homogeneous linear separators, Ti, he the class of homogeneous lin- 
ear separators represented by unit length normal vectors, and TD be an arbitrary spherically symmetric 
distribution. Define Neigh as in Equation]^ and let p be any polynomial satisfying p{n, 1/e) > 2n— 1 . 
Then C is evolvable with drifting targets over V by algorithm A ~ (TZ, Neigh, fi,t, s) with 

• any distributions fi that satisfy fi(r,r',e) > l/p(n,l/e) for all r G TZn, e, and r' G Neigh[r,e), 

• tolerance function t{r, e) = e/^ir^n) for all r G TZn, 

• any generation polynomial g(n, 1/e) > Sir^n/e, 

• a sample size s{n, 1/e) = 0(n^/e^), and 

• any drift polynomial d(n, 1/e) > Sir^n/e, which allows drift A < e/(87r^n). 



To prove this, we need only to show that Neigh is a strictly beneficial neighborhood function for 
C, T>, and TZ with b{n, 1/e) = 7r^n/(2e). The theorem then follows from Theorem El The analysis 
relies on the fact that under any spherically symmetric distribution V (for example, the uniform 
distribution over a sphere), errp(u, v) = arccos(u • v)/7r, where arccos(u • v) is the angle between u 
and V . This allows us to reason about the performance of one function with respect to another 
by analyzing the dot product between their normal vectors. 

6.2 A Component- Wise Evolution Algorithm 

We now describe the alternate algorithm for evolving homogeneous linear separators. The guarantees 
we achieve are inferior to those described in the previous section. However, this algorithm applies 
when I? is any unknown product normal distribution (with polynomial variance) over R". 

Let r.i and fi denote the ith components of r and f respectively (not the values of the represen- 
tation and ideal function at round i as in previous sections). The alternate algorithm is based on 
the following observations. First, whenever there exists some i for which r,; and fi have different 
signs and aren't too close to 0, we can obtain a new representation with a non-trivial increase in 
performance by flipping the sign of r^. Second, if there are no beneficial sign flips, if there is some i 
for which r^ is not too close to fi, we can obtain a new representation with a significant increase in 
performance by adjusting r^ a little and renormalizing. The amount wc must adjust r^ depends on 
the standard deviation of V in the ith dimension, so wc must try many values when V is unknown. 
Finally, if the above conditions do not hold, then the performance of r is already good enough. 

Denote by {ei}f^-^ the basis of R". Let ci, . . . ,an be the standard deviation of the distribution 
V in the n dimensions. We assume that 1 > (Ji > (1/n)*' for some constant k for all i, and that 
the algorithm is given access to the value of k, but not the particular values (7^. We define the 
neighborhood function as Neigh(r, e) = N^ U N^i, where Na = {r — 2riei \ i = 1, . . . ,d} is the set of 
representations obtained by flipping the sign of one component of r, and 
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is the set obtained by shifting each component by various amounts. We obtain the following. 



^Technically we must assume that the representations r £ TZn and input points x G R" are expressed to 
a fixed finite precision so that r • x is guaranteed to be computable in polynomial time, but for simplicity, in 
the analysis that follows, we treat both as simply vectors of real numbers. 



Theorem 16 LetC he the class of homogeneous linear separators, andTZ be the class of homogeneous 
linear separators represented by unit length normal vectors, and T) be a product normal distribution 
with (unknown) standard deviations ci, • • • , (t„ such that 1 > <Ji > {\/n)^ for all i for a constant k. 
Define Neigh as above and let p be any polynomial such that p{n, 1/e) > Sn^'^^^ + 2?!. Then C is 
evolvable with drifting targets over V by algorithm A^ {TZ, Neigh, fi,t, s) with 

• any distribution jj. satisfying ^{r, r' , e) > \/p{n, 1/e) for all r G 7?.„ and r' G Neigh{r, e), 

• tolerance function t{r, e) = e^/(288n), 

• any generation polynomial g{n, 1/e) > 2304n/e^, 

• a sample size s(n, 1/e) = 0(n^/e^^), and 

• any drift polynomial d{n, 1/e) > 2304n/e^, which allows drift A < e^/(2304ri). 

The proof formalizes the set of observations described above, using them to show that Neigh is 
a strictly beneficial neighborhood function for C, V, and TZ with h(n, 1/e) = 144n/e^. The theorem 
is then an immediate consequence of Theorem |8l 

7 Evolving Conjunctions with Drifting Targets 

We now show that conjunctions are evolvable with drifting targets over the uniform distribution 
with a drift of O(e^), independent of n. We begin by examining monotone conjunctions and prove 
that the neighborhood function defined by Valiant [l^ is a strictly beneficial neighborhood function 
with b(n, 1/e) = e^/9. Our proof uses techniques similar to those used in the simplified analysis of 
Valiant's algorithm presented by Diochnos and Turan Q- By building on ideas from Jacobson jl4 |. 
we extend this result to show that general conjunctions are evolvable with the same rate of drift. 

7.1 Monotone Conjunctions 

We represent monotone conjunctions using a representation class 7?. where each r S 7?, is a subset of 
{1, • • • , n} such that \r\ < log2(3/e), representing the conjunction of the variables Xj for all j e r. 
We therefore allow the representation class to depend on e in our analysis. This dependence is easy 
to remove (e.g., using Valiant's technique of allowing an initial phase in which the length of the 
representation decreases until it is below log2(3/e) |19j). but simplifies presentation. 

The neighborhood of a representation r consists of the set of conjunctions that are formed by 
adding a variable to r, removing a variable from r, and swapping a variable in r with a variable 
not in r, plus the representation r itself. Formally, define the following three sets of conjunctions: 
AA+(r) = {rU{j}\j ^ r}, J\f- (r) = {r\{j}\j G r}, and AA±(r) = {r\{j} U {k}\j G S,k ^ S}. The 
neighborhood Neigh(r, e) is then defined as follows. Let q = [log2(3/e)]. If r is the empty set, then 
Neigh(r, e) =7V+(r)Ur. If < \r\ < q, then Neigh(r, e) ^J\f+{r) U Af^ (r) UN^{r)Ur. Finally, if 
\r\ = q, then Neigh(r, e) = /^"{r) U Af^{r) U r. Note that the size of the neighborhood is bounded 
by l + n + n^/4 in the worst case; the combined size of the sets Af'^{r) and Af~ (r) is at most n, and 
the size of N^{r) is at most n^/4. We obtain the following theorem. 

Theorem 17 LetC be the class of monotone conjunctions, TZ be the class of monotone conjunctions 
of size at most q = [log2(3/e)] represented as subsets of indices, and V be the uniform distribution. 
Define Neigh as above and let p be any polynomial satisfying p{n, 1/e) > 1 + n + n^/4. Then C is 
evolvable with drifting targets over V by algorithm A^ (TZ, Neigh, fi,t, s) with 

• any distributions fi that satisfy fi{r, r' , e) > l/p{n, 1/e) for all r G TZn, e, and r' G Neigh{r, e), 

• tolerance function t{r, e) = e^/18 for all r G TZn, 

• any generation polynomial g(n, 1/e) > 144/e^, 

• a sample size s{n, 1/e) = 0(l/e^), and 

• any drift polynomial d{n, 1/e) > 144/e^, which allows drift A < e^/144. 

To prove the theorem, we show that Neigh is a strictly beneficial target function with benefit 
polynomial b{n, 1/e) = 9/e^ and once again appeal to Theorem [S] The proof is then essentially 
just a case-by-case analysis of the performance of the best r' G Neigh(r, e) for an exhaustive set of 
conditions on r and /. 

7.2 General Conjunctions 

Jacobson [14| proposed an extension to the algorithm above that applies to general conjunctions. 
The key innovation in his algorithm is the addition of a fourth set M' {r) to the neighborhood or r. 



where each r' e N'{r) is obtained by negating a subset of the literals in r. We show here that the 
drift rate of his construction can be analyzed in a similar way to the monotone case. 

We represent general conjunctions using a representation class TZ where each r G 7?. is a subset 
of {1, • • • , n} U {—1, • • • , — n} such that \r\ < log2(3/e). Here each r represents the conjunction of 
literals Xj for all positive j £ r and negated literals x^j for all negative j € r, and we restrict TZ so 
that it is never the case that both j G r and —j e r. The dependence of this representation class on 
e can be removed as before. 

As before, the neighborhood of a representation r includes the set of conjunctions that are formed 
by adding a variable to r, removing a variable from r, and swapping a variable in r with a variable 
not in r, plus the representation r itself. However, it now also includes a fourth set N'{r) of all 
conjunctions that can be obtained by negating a subset of the literals of r. The size of the set M'{r) 
is at most 2"? < 6/e, so by a similar argument to the one above, the size of the neighborhood is 
bounded by 1 + 2n + n^ + 6/e. We obtain the following theorem. 

Theorem 18 Let C he the class of conjunctions, TZ be the class of conjunctions of at most q = 
[log2(3/e)] literals represented as above, and V be the uniform distribution. Define Neigh as above 
and let p be any polynomial satisfying p{n, 1/e) > l + 2n + n^ + 6/e. ThenC is evolvable with drifting 
targets over V by A = {TZ, Neigh, fi, t, s) with fi, t, g, s, and d as specified in Theorem \ 1 7\ 

The proof uses many of the same ideas as the proof of Theorem [T71 However, there are a few 
extra cases that need to be considered. First, if / is a "long" conjunction, and r contains at least 
one literal that is the negation of a literal in /, then we show that adding another literal to r leads 
to a significant increase in performance. (If r is already of maximum size, then the performance is 
already good enough.) Second, we show that if / is "short" and r contains at least one literal that is 
the negation of a literal in /, then there exists an r' G Af'{r) with significantly better performance. 
All other cases are identical to the monotone case. 
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A Additional Proofs 

A.l Accuracy of the Empirical Performance 

In order to prove Lemma [5] and Theorem |SJ it is necessary to examine how close the empirical 
performance of a representation r is to the representation's true performance. The following simple 
lemma shows that as long as the sample size s{n^ 1/e) is sufRcicntly large, the empirical performance 
of each representation will be close to the true performance with high probability. 

Lemma 19 Consider any r ^ TZ and f (z C and fix any Z > and S > 0. Let N he an upper bound 
on the size of the neighborhood Neigh{r, 1/e). For each r' G Neigh{r,e), let v{r') be the empirical 
performance of r' with respect to f on a sample of size s > 2\\\{2N / 5)/ Z" . With probability 1 — 5, 
for all r' <E Neigh{r,e), \v{r') - Perfj:{r' ,'D)\ < Z. 

Proof: Consider a particular r' <E Neigh(r, e). By Hoeffding's inequality [I3, for any Z, 
Pr(|w(r') -Perf/(r',X')| > Z) < 2 exp (-sZ2/2). The right hand side of this inequality is up- 
per bounded by S/N as long as s > 2ln{2N/6)/Z'^, as we have assumed. The lemma then follows 
from a standard application of the union bound. H 

A. 2 Proof of Lemma [5] 

Suppose that Neigh is a strictly beneficial neighborhood function for C, V, and TZ with benefit 
polynomial 6(n, 1/e). To prove the first half of the lemma, we will construct an algorithm for 
strictly monotonically evolving C over 2?. First, for any r G TZn and e > 0, we set the tolerance at 
t{r, e) = l/{2b{n, 1/e)). We then set s{n, 1/e) = I28{b{n, l/e))^ ln(2p(n, l/e)/S) for a choice of 5 that 
will be specified below. By Lemma [TOl this guarantees that on a particular round i, with probability 
at least 1 — S, for all r S Neigh(ri_i, e), \v{r) — Perf/(r,I?)| < l/(86(n, 1/e)). For the remainder of 
this proof, we refer to this high probability event as the HPE. 

For any fixed round i, consider first the case that Perf/(ri_i,2?) > 1 — e/2. Since ri_i E 
Neigh(ri_i, e), there is always at least one neutral mutation available and there could be a beneficial 
mutation, so r,; will always be chosen from either Bene or Neut. Consider an arbitrary r^ chosen 
from Bene U Neut. If the HPE occurs, then 

(Perf,(..-.,P)-Perf,(..P)) < («(^^-) + ^^(;^) " (-(''O - ^^(^^ 

, , 1 3 3e 

< tir e\ -A = < 

- ^ ' ^ 46(n,l/e) 46(n, 1/e) " 8 ' 

The last inequality uses the fact that l/b{n, 1/e) < e/2. This must be the case to guarantee that 
an improvement of l/b{n, 1/e) is possible when the performance is arbitrarily close to (but still less 
than) 1 — e/2; otherwise, the definition of strictly beneficial neighborhood would not be satisfied. 
We then have 

Perf/(r„P)>l-|-|>l-e. (2) 

Now consider the case in which Perf j^(rj_i,2?) < 1 — e/2. Since Neigh is a strictly beneficial 
neighborhood function, it must be the case that there exists a representation r G Neigh(ri_i, e) such 
that Perf/(r,X') > Perf /(r^^i,!?) + l/6(n, 1/e). CaU this representation r* . If the HPE occurs, 
then 

virn-vir,-,) > (Perf/(.%P)~^^(^)-(perfK..-i,P) + ^^(^ 

= (Perf,(.M.) - PerfKn-.I^)) - ^^^ > ^^^ > t{r,e) , 

and so r* G Bene. Since the set Bene is non-empty, a representation in this set will be chosen for r^ 
Consider an arbitrary r^ chosen from Bene. If the HPE occurs, then 

(Perf,(r.P) - Perf ,(r,_r,P)) > {.{r) - ^^^) - (.(^.-O + ^^^ 

= (-W-^'(n-i))-4^^^ 
1 _ 1 

- ^^'^'^^~46(n,l/e) " 46(n, 1/e) ' ^^^ 



Now, let g{n, 1/e) — 8b{n, 1/e). Setting the parameter S ~ £/g{n, 1/e) above and applying the 
union bound again, we have that with probability at least 1 — e, the HPE occurs at all round 
i £ {1,2,--- ,p(n, 1/e)}. Suppose this is the case. From the argument leading up to Equation [31 
we know that the performance of the current representation is monotonically increasing as long as 
the performance is less than 1 — e/2, and furthermore increases by at least l/{Ab{n,l/e)) on each 
around. It remains to show that the algorithm evolves C, that is, that Perfy(rg(„ j^/g),!?) > 1 — e. 

From the monotonic improvement when the performance is less than 1 — e/2 and the argument 
leading up to Equation [5J it is clear that if the performance ever reaches 1 — e/2, it will not fall 
below 1 — e again before round g{n, 1/e). It is easy to see that the performance reaches 1 — e/2 at 
some point during these g(n,l/e) rounds. In the worst case, the performance starts at —1. It is 
guaranteed to increase by at least l/(46(?i, 1/e)) on each round. Thus it must reach 1 — e/2 in no 
more than g{n, 1/e) = 8b{n, 1/e) rounds. 

To prove the second half of the lemma, note that the definition of strictly monotonic evolvability 
requires that for any initial representation tq G TZ, with high probability either Perf y (ro, P) > 1 — e/2 
or Perf/(ri,I?) > Perf/(ro,I?) + l/TO(n, 2/e). This implies that if Perf/(ro,X') < 1 - e/2 there 
must exist an ri G Neigh(ri,e) such that Perf/(ri,I?) > Perf/(ro,I') + \/m{n,2/e). ■ 

A.3 Proof of Theorem H 

Suppose that Neigh is a strictly beneficial neighborhood function for C, T). and TZ with benefit 
polynomial 6(n, 1/e). For any r G ??.„ and e > 0, we set the tolerance at t(r, e) = l/{2h{n, 1/e))- We 
then set s(n, 1/e) = 128(6(n, 1/e))^ ln(2p(n, \/e)/5) for a choice of 5 that will be specified below. This 
guarantees that on a particular round i, with probability at least 1 — (5, for all r G Neigh(rj;_i, e), 
l'y(r) -Perf/,_i(r,2?)| < l/(86(n, 1/e)). (See Lemma [11 in Appendix [XT] for details.) For the 
remainder of this proof, we refer to this high probability event as the HPE. 

Fix an i. Suppose A < l/(166(n, 1/e)). If /i, /2, • • • is a A-drifting sequence, then for any r £ TZ, 

\PeTff^_,ir,V)-Perff^ir,V)\ < E^^v[\fr-iix) - Mx)\ ■ \rix)\] < 2errp(/,_i, /,;) 

< 2A< l/(86(n, 1/e)) . (4) 

Consider the case that Perff._-^(ri-i,T)) > 1 — e/2. Since ri_i G Neigh(ri_i, e), there is at least 
one neutral mutation available and there could be a beneficial mutation, so ri will be chosen from 
either Bene or Neut. Consider an arbitrary r.i chosen from Bene U Neut. If the HPE occurs, then 

Perf,_,(.._„P)-Perf,._,(.„P) < {^(r^-^) + ^^;^) - {^in) - ^^^ 

1 _ 3 

- *^''' ^' ^ Ab{n, 1/e) ~ Ab{n, 1/e) ' 

Then from Equation [3| and the assumption that A < l/(16fe(n, 1/e)), 

1 



Perf/.(r„I?) > Perf /._, (r,, P) - 



86(n,l/e) 



The last line uses the fact that l/b{n, 1/e) < e/2. This must be the case to guarantee that an 
improvement of l/b{n, 1/e) is possible when the performance is arbitrarily close to (but still less 
than) 1 — e/2; otherwise, the definition of strictly beneficial neighborhood would not be satisfied. 

Now consider the case in which Perf /._j(ri_i,2?) < 1 — e/2. Since Neigh is a strictly beneficial 
neighborhood function, it must be the case that there exists a representation r G Neigh(r,;„i, e) such 
that Perf/._,(r,X') > Perf /._, (r,_i,X>) + l/&(r7,, 1/e). Cah this r* . If the HPE occurs, then 

v{r*)-v{n_i) > (Perf/^_,(r*,P)- \ , J - ( Perf /,_, (r,-i,2?) 



n,l/e); V 86(n,l/e) 

= (Perf,_.(.M)) -Perf ,_,(.._„ P)) - ^^^ > ^^^ > t(.,e) , 

and so r* G Bene. Since the set Bene is non-empty, a representation in this set will be chosen for r,;. 
Consider an arbitrary r^ chosen from Bene. If the HPE occurs, then 

Perf/,_,(r„I?) - Perf /,_, (r,„i,I?) > {v{r) - l/(86(n, 1/e))) - (v(r,_i) + l/{8b{n, 1/e))) 

> i(r, e) - l/(46(n, 1/e)) ^ l/(46(7i, 1/e)) . 



Then from Equation 21 

Perf /, (r, V) - Perf /,_, (r,_i , V) > Perf /._, (r, V) - Perf /_, (r,_i , P) - 2A 

> l/(46(n,l/e))-l/(8&(r^,l/e)) = l/(86(n,l/e)). (6) 

Now, let g(n,l/e) ~ 166(n, 1/e) and consider any round i > g{n,l/e). Setting the parameter 
S ~ e/g{n, 1/e) above and applying the union bound again, we have that with probability at least 
1 — e, the HPE occurs at all rounds i G {£ — g{n, 1/e), • • • ,£—!}. Suppose this is the case. 

From the argument leading up to Equation [6j we know that the performance of the current rep- 
resentation with respect to the current target is monotonically increasing as long as the performance 
is less than 1 — e/2, and increases by at least l/(85(n, 1/e)) on each round. Combining this with the 
argument leading up to Equation [S] it is clear that if the performance ever reaches 1 — e/2 during 
this period of time, it will never again fall below 1 — e before round £. It remains to show that the 
performance reaches 1 — e/2 at some point during these g(n, 1/e) rounds. This is also easy to see. In 
the worst case, the performance starts at —1. It is guaranteed to increase by at least l/(86(n, 1/e)) 
on each round, so it must reach 1 — e/2 in no more than g{n, 1/e) = I6b(n, 1/e) rounds. 

This shows that for any i > gin, 1/e), with probability at least 1 — e, Perfy^(r£,I?) > 1 — e and 
so C is evolvable with drifting targets. ■ 

A. 4 Proof of Lemma 1101 

Assume that i < j. The proof for the case in which i > j is nearly identical, and the result is trivial 
if i = j. For any r and any function : A" — > [—1, 1], 



|E[0(a;)/,(a;)]-E[0(x)/,(x)] 



|E[^(a;)(/,(x)-/,(x) 



E 



H^) ^ift+k-iix) - fi+k{x)) 



k=l 



< E 



3-1 



\Hx)\ X! If-i^+k-lix) - fi+k{x)\ 



fc=l 



J-s 



< ^ E U+k-i{x) ~ fi+k{x)\] < (j - *)A < -, 



fe=i 



where all expectations are taken with respect to a; ~^ 2?. Therefore if E.j.^'p[4'{x) fj {x)] > + t, then 

E,^vmx)Mx)] > E,^p[(/)(x)/,(x)] -I>0 + r-^ = 
Similarly, if Eix^T>[4'ix)fjix)] < ^ t, then 



^x~vmx)Mx)] < E,^vmx)fj{x)] + -< 



T 

^+2 



A. 5 Proof of Lemma 111! 

Under the assumption that the LPE does not occur at any time step, after q time steps if r^ — i\ [ft,, z\ , 
then \z\ = q, since we add one bit at each step. Let ri = r^[h, z^]. We consider the possible mutations 
of Ti. Observe that r^ [ft, z^Q] is always neutral for all i. The cases we need to consider are (a) r^ [ft, z'l] 
is beneficial, and therefore chosen as the next representation implying that Zi+i = 1, (b) r^[h,z^l\ 
is deleterious, and therefore re[ft, z'O] is chosen as the next representation, implying that Zi+i = 0, 
and (c) re[ft, 0*1] is neutral, which implies that either rf\h, z*l] or rf\h, z*0] can be chosen as the next 
representation, implying that z^+i = or 1. 

Suppose we are in case (a), then we show that 1 is a valid answer to the query 
with respect to fi. Consider, 



^js' ,e; ^ z'- .e^ 



t\n,-] <t;(r,[ft,zn])-u(r,) <Perf/,(re[ft,zn],P)-Perf/,(r„23) + 2T' = ^E[, 



r/2) 

..,e-/.]+2T' . 



Re-arranging the terms, we get: 



e 



1 



2r' > 



^ z^ ,e 



r 
ej J - 2 ' 

Similarly one can show in case (b), that E[(j)^i,: ■ f] < O^i^ — r/2 and hence is a valid answer 
to the query {(j)z\ei ^z-.ei '''/2) and in case (c), that 6*^; ,, — r/2 < E[(/)2i ^ • /] < O^i,, + r/2, and hence 
both and 1 are valid answers. 

By Lemma 1 101 if Zi+i is a valid answer to the query (0z>.£j^2'.£j'''/2), with respect to fi it is a 
valid answer to the (^^i g,^^; £,r) with respect to fq (since A < T/{2q)). Thus z is consistent with 

u ' ' ■ 



A.6 Proof of Lemma [H 

Let t' = tu{n, l/e)/8. Assuming that the LPE does not occur at any time step, Wj+i is always a 
neutral mutation for Wj, and mutations of the form Wj — >■ Wj will not occur. Also r^[hz^e,cF] will 
always be chosen if it is a neutral mutation. Then in K — k rounds we will reach wk (if we had not 
already gone to r^[hz^e^ cr]) and hence on the next round we will move to ^^[O, a]. This implies that 
the number of steps is at most K — k + 1. 

Now, suppose that if at some stage Pert f.{ri,'D) < Perff\{r^[hz^^,a],'D). Then 

Perf /._, ir,[hz^„ a],V) - Perff^_, (r,_i, P) 

> Pert f^{r^[hz^,,a],V)- Pert f^_^{r,,V) ^ A> ^-^ - A, 

and so r^[hz^e,<^] is a neutral mutation for r^-i. By the assumption above, r.i = r^[hz^e,o'], proving 
the lemma. ■ 

A. 7 Proof Sketch for Theorem \T4\ 

We apply pieces of analysis of Theorem [T3] here. We omit some details since the arguments are very 
similar; in fact, the argument that this algorithm is resistant to drift is nearly identical. To start, 
we let the representation class be TZe which depends on e. Here, backsliding is not allowed since it 
degrades performance arbitrarily. We discuss how to encode all values of e in the same representation 
class in the Section IA.7.11 below. 

We will use the same construction as defined in Section 15.31 with only a small modification. For 
the representations in W^, say wq = rg[h, z] with \z\ = (j, and in the neighborhood of Wk we will 
also add rg[h,a] in addition to the existing r^[hz,a],r^[0,a] and Wk+i- Thus, even if hz* has poor 
performance, we can ensure that the performance goes more than e lower than the starting state. 
Formally, 

• Neigh'(w/f,e) = {wK,re[0,a]} and Neigh'(wfe,e) = {wk,Wk+i,r^[hz,a],r^[h,a]} for k < K, 

• fi'{wK,WK) = V and ^i'{wK,r^[0,a]) = I - rj, and n'{wk,Wk) = ?7^, fi'{wk,Wk+i) = 
n'{wkjre[hz,a] = (?? — f?^)/2, and iJ.'{wk,r^[h,a]) = 1 — 77 for fc < K, and 

• t'{wk, e) = tu{n, 1/e) for all fc. 

Let T] = e/{4q + 2K), t' = min{(er)/(2g), tw(n, l/e)/8}, and s = 1/(2(t')^) log((6g + 3A')/e), as 
defined earlier in section [5.31 Let Neigh' = Neigh, n' = fi and t' = t for the representations in R^. 
We will show that 8 = (??.£, Neigh', /i', f', s) evolves C quasi- monotonically. 

The intuition of the proof is as follows. Any two representations ^^[/i, z] and rf^[h, z'\ arc within 
performance e of each other (by definition). Using a similar argument as that for Lemma 1121 one 
can show that while we start decreasing performance from wq = T^\h^ z], the performance never dips 
below the performance of r^\h^a\ (and since this has the highest probability, this will be chosen 
whenever it is neutral). If r^[hz,a] is chosen earlier, it will be because its performance was higher 
than that of r^[h,a] and quasi-monotonicity is maintained. Just as in Theorem 1131 one can show 
that in at most 2g + A + 1 steps, a representation with performance 1 — e is reached. 

For our setting of parameters, for g time steps LPE docs not occur. And we will assume that 
this is the case. Rccah that A = (eT)/(4g + 2A + 2). 

There are two distinct types of starting representations: (i) re[ft,, z] with \z\ < q, or (ii) Wk for 
some fc where wq = rg[h,z] with \z\ = q. Suppose first that the starting representation is rg[ft., z]. 
Since LPE events don't occur, we will reach r^[h, z*] with \z*\ = q in q — \z\ steps. Note that for all 
z' for any /, |Perf /(r(;[ft,, z'], I?) — Perff{r^[h,z],'D)\ < e. So during this phase quasi-monotonicity 
is maintained. 

Consider the case in which the starting representation is instead Wk for some fc, with wq — Tc [h, z*] 
and |z*| ~ q, or the case in which we reach such a representation Wk after starting at r^[h, z]. The 
algorithm then transitions to either representation r^[h, a] or representation r^[hz* , a]. Furthermore, 
the transition to r^[hz* , a] happens only if Perf /(/i^* ,T?) > Perffih, V). This happens in at most K 
steps (using argument similar to that of Lemma [T^ and during this time the performance never goes 
below that of r^[h, a], and so never goes more than e/2 below that of the starting representation. 

Let h' be either h or hz* depending on which was chosen as described in the above paragraph. 
Since A is low, the performance of h' will never go significantly below that of h (even if h' = hz* ) for 
the next g steps, hence it is sufficient to prove then that the performance will not drop significantly 
below that of re[/i',cr]. From r^[h',a] in at most q steps we reach a representation r^[h',z'] where 
z' is consistent with /. During this time the performance never goes more than e/2 below that of 
Perf (re [/i', cr], I?) . From rg[/i',z'] we reach a representation with performance greater than 1 — e in 
one step, or rg[/i', z'] already has performance at least 1 — e. Thus, the evolution is quasi-monotonic. 



A. 7.1 Removing the Need to Know e 

In Section Fa. 71 wc showed that any CSQ algorithm can be converted into an cvokitionary algorithm 
that is drift-resistant and quasi-monotonic, provided we arc allowed to fix e and encode its value 
in the representation. Here, we describe in some detail how a representation class that simultane- 
ously encodes all values of e can be constructed. Note that the definition of evolvability allows the 
neighborhood to depend on e, but not the starting representation. 

We assume that the parameter e provided to the algorithm is a power of 2. If this were not the 
case we could simply run the algorithm with e', setting e' = 2'^'°seJ_ xhe performance guarantees 
would only be better since e' < e. Furthermore, since e' > e/2, the running time would not be 
affected, except up to a constant factor. The representations will encode values of e ranging over 
the set ^e = {1/2, 1/4, . . . , 2~"}. It is not necessary to consider values of e smaller than this, since 
this would allow the algorithm to take time exponential in n, and hence an exhaustive search over 
all functions of polynomial-sized representations would be feasible in just one round of evolution. 
For the rest of this section, assume that e can only take values from this set. 

Recall the notation used in Section[5] In particular, ^ is a CSQ> algorithm that takes parameter 
e, makes q = q{n,l/e) queries of tolerance T{n, 1/e) , returns a hypothesis h with Perf / (h,!)) > 1 — e. 
Similar to the definitions in l5.3[ let ^z.e = (l/?) X]?=i ^i^j = 1)0zj-i ni^)- 

Define a term as follows: 

• Every h E H is a. term, and h is said to encode no e. 

• For any ei, let Ti be a term that either encodes no e or encodes only e' > ei. Then T = 
(1 — ei/2)Ti -I- {ei/2)^z,ei is a term if \z\ < q{n, 1/ei). Furthermore, T is said to encode all of 
the values of e that Ti encodes plus ei. 

Thus any term T may encode up to n values of e, and the values of e will increase as we get deeper in 
the term. This ensures that all terms have polynomial-sized (in n) representations, and the number 
of terms is finite. 

Let list(T) denote the list of all e e S'e that are encoded in T. Observe that the definition of 
term implies that the smallest e € list(r) is encoded at the outermost level, and the values increase 
as we move to the interior. In particular if ei is the smallest value in list(T), then T = {l — ei/2)Ti + 
(ei/2)$2_£j for some z and Ti and list(Ti) = list(T) \ {ei}. Denote by out(T) the smallest value 
of e in list(T) and let next(r) be Ti such that T = (1 - out(T)/2)Ti + (out(r)/2)$^-_out(T)- 

We consider all terms except those of the form ft, S H to be valid representations. The represen- 
tation class will also contain more representations, that we shall define shortly. 

Consider the following three cases. 

(a) The evolutionary algorithm is in a state T, such that out(T) = e, where e is the true parameter of 
the algorithm. Then (pretending as if T is in "H) the results from Section[S]will apply directly. In 
particular, let T = r, [ri,z] = {l-e/2)Ti + {e/2)^z,e- Then Neigh(T, e) = {T,r,[T, zO],r,[T, zl]} 
if l^:] < q{n,l/e). When \z\ — q{n,l/e), we again define states similar to those in VF^ in 15.31 
which allow the algorithm to gradually slide to move to r^[hz,(T], r^[T,a] or r(:[0, tr] (but the 
performance will never go more than e lower than T with high probability). 

(b) The case when out(T) < e. Let Tq = T, and define Ti = next(ri_i) for all i, let k be the 
smallest such that out(T'fe) > e. (It may happen that Tk ~ h for some ft. <E "H. Note that, 

Ti = (l-ei/2)((l-e2/2) (• • • (1 - eu-il2)n + (efc-i/2)<i>,,_„,,_, • • • ) + (e2/2)<i>.„.,)+(6i/2)<J>,„, 

Then since ei < £2 < • • • < Cfc-i < e/2, and every e,; is a power of 2, 

Perf/(Tfe,I?) >Perf(Ti,P)-2(ei + ---efe_i)>4(e/2) . 

Then Perf/(Tfc,X>) > Perf/(T,I?) - 2e (because ek-i < e/2 ). If out(rfc) ^ e, let n = Tk. 
Otherwise, let rj, = (1 — e/2)Tk + (e/2)$o-.e- Note that ri, is always a valid term (by the above 
definition) and hence it is in the representation class. Also Perfy(rf,,I?) > Perf (T, I?) — 3e. 

(c) The case, when out(r) > e. Let re = (1 — e/2)T + (e/2)$cr.e- Again, Vc is a valid term and 
hence in the representation class. Also in this case Perf /(re, I?) > Perf/(T, 2?) — e. 

In cases (b) and (c), if we can transition to the representations rb and Vc respectively, we will 
have reduced to case (a). However since the moves themselves may be deleterious, we need to 
add intermediate representations similar to those defined in W^ in Section 15.31 In particular let 
wq = T (where T may be that of case (b) or (c)). Define Wfc = (1 — k{tu{n,l/e)/2))wo, where 
tu(n,l/e) is the polynomial upper bound on the tolerances. Define Neigh(zi;fe, e) = {wk,Wk+i,rb} 



(or Neigh(i(jfc, e) ~ {u)fe, w/j+i, rd). The idea is the same that the performance reduces gradually 
until the jump to rf, (or re) is no longer deleterious. During this time the performance never goes 
below that of r^ (respectively Vc) and hence quasi-monotonicity is maintained. (Although in some 
cases degradation may be 3e, we could just run with higher accuracy (say e/4) to begin with.) 

So far we have ignored the drift. However, notice that the number of time steps to get to a 
representation which encodes the correct value of e is, with high probability, polynomial (in fact 
just 2/tu{n, 1/e)). Thus by making the drift small enough (though still an inverse polynomial), the 
function can be made to look essentially unchanging to the evolution algorithm. 

A.8 Proof of Theorem [H 

We show that Neigh is a strictly beneficial neighborhood function for C, V, and TZ with b{n, 1/e) = 
7r'^n/(2e). The theorem is then an immediate consequence of Theorem |S1 

The analysis relies heavily on a couple of useful trigonometric facts. First, it is well known (see, 
for example, Dasgupta [y]) that under any spherically symmetric distribution D (for example, the 
uniform distribution over a sphere), errx)(u, v) = arccos(u • v)/7r, where arccos(u • v) is the angle 
between u and v. We will use this fact repeatedly. We also make use of the following inequalities 
from Dasgupta et al. 0. For any e [0,7r/2], 26'/7r < sin(6') < 6*, and 40^ /n^ < 1 - cos(6l) < 0^/2. 

Consider an arbitrary r G TZn and f €! C„. To simplify presentation, assume that ri — 1 and 
Ti = for i £ {2, • • • , n}. (Here and for the remainder of this proof, we use the notation ri and fi to 
denote the ith components of r and f , not the values of the representation and ideal function at round 
i as in previous sections.) This assumption is without loss of generality since we are considering 
only spherically symmetric distributions. Furthermore, assume that the axes are oriented such that 
for any r' e Neigh(r, e) (except for r itself), r'l = cos(e/(7rY^)), r'i = ±sm{e /{ny/n)) for some 
z £ {2, • • • , n}, and r'j = for all other j. This change in basis is also without loss of generality. 

Suppose that Perf f (r, T>) < 1~ e/2 since otherwise there is nothing to prove. The condition that 
we need to prove can be stated as 

max Perff(r',X') > Perff(r,X>) + - — — - = Perff(r,X') + -^ . 

r'GNelgli(r,e) 0(71, 1/e) Tl'^n 

Using the facts that for any unit vectors u and v, Perfv(u, 2?) = 1 — 2err(u, v) and errxi(u,v) = 
arccos(u • v)/7r, this condition is equivalent to arccos(maXr'gneigh(r,c) r' • f ) < arccos (r • f ) — e/(7r^n). 
By definition of the neighborhood function, there exists a r' € Neigh(r, e) such that 



r' • f > /i cos ( — -= ) + max \fi\ sin ( — -= ) > /i cos 1 — -= ) + \ — sin ' 



TT^nJ ie{2,---,n} \TTy/n/ yTTy/nJ V n \7Ty/n 

Using the standard trigonometric equality that for any and (f), arccos(0) — arccos((/)) — arccos 
y/{l-0^){l-(t)'^)), we have 

arccos(r • f) ^— = arccos(/i) — arccos (cos 



Then since arccos is decreasing in [0,7r], to prove the result, it is sufficient to show that 




arccos ( /i cos ( -^ ) + W^-^sin ( -^ ) 1 < arccos ( /i cos (-J-) +Jl- /f sin (-^ 



TT^n/ V \7r^n 



or taking the cosine of both sides and rearranging terms. 

First consider the case in which /i < 0. In this case, it is sufficient to show that the difference 
of cosines on the left hand side of the equation and the difference in sines on the right hand side are 
both positive. This can be verified easily using the inequalities for sines and cosines given above. 

For the rest of this proof, assume that /i > 0. Since we have assumed that Perf f (r, P) < 1 — e/2, 

it follows that err(r, f) = arccos(/i)/7r > e/4, or equivalently, /i < cos(e7r/4). Then yl — /i > 
1 — (cos(e7r/4)) = sin(e7r/4), and for Equation [7] to hold, it is sufficient to show that 

'''' (f ) (^"^ (i) - ^°^ (^)) ^ ^^" (f ) (;^^^" (^) - "" (i)) ■ ^') 



Using the inequalities for sines and cosines given above, we have that 

cos (^11 cos ( -^r- I — cos I — = II < 1 I 1 — cos I — = I I < 



e \\ e / 2e e 



•2 



and 

V 4 / \^ \7r^/ \TT^n/ J ~ 2 \TT^n tt^uJ 2^^ 

Therefore Equation |8] holds, Neigh is a strictly beneficial neighbor function, and C is evolvable with 
drifting targets. ■ 

A. 9 Proof of Theorem [16] 

We start by analyzing the simpler case in which V is known to be a spherical Gaussian distribution. 
In this case, a simpler neighborhood function can be used in which set of "shift" neighbors N^i 
is greatly reduced. Below, we explain how to extend this analysis to the case in which V is an 
unknown product normal distributions over R" and more complex neighborhood function defined in 
Section [n21 is used. 

Throughout this proof, we use the notation r^ and fi to denote the zth components of r and 
f respectively, and denote by e^ the basis of R". We define the simplified neighborhood function 
as Neigh'(r, e) = N^ U -/V^'j, where iVfl = {r — 2riei | i = 1, . . . , d} is still the set of representations 
obtained by flipping the sign of one component of r, and A^^'j — {r';/||r^||2 | rj = r± /Je^, i — 1, . . . ,d} 
is the set obtained by shifting each component a small amount and renormalizing, with /3 satisfying 

eV(6V^) < /3 < eV(3v^). 

We first show that for any target function f, increasing Perff (r,2?) is the equivalent to increasing 
f • r. The following two lemmas establish this. These lemmas rely on the same trigonometric facts 
used in the proof of Theorem 1151 In particular, under any spherically symmetric distribution V, 
errp(u,v) = arccos(u • v)/7r, where arccos(u • v) is the angle between u and v, and for any 
e e [0,7r/2], 26'/7r < sin(6l) < 9, and 40^ /n^ < 1 - cos(6l) < 9^/2. 

Lemma 20 Let V be a spherical Gaussian distribution. For unit vectors v, f and a G (0,1), if 
f • V > 1 - a, then Perff{v, V) > 1 - ^/a 

Proof: Since f and v arc unit vectors and a G (0, 1), i.e., f • v > 0, we may write f • v = cos{9) 
for 9 e [0,7r/2]. Thus we have that 1 - 49'^ /tt'^ > cos{9) > 1 - a, and hence 9/tt < ^/a/2. But 
e/ir = err (f , v) and Perf f (v, P) = 1 - 2err (f , v) > 1 - V"- ■ 

Lemma 21 LefD be a spherical Gaussian distribution. For unit vectors u, v, f , i/f-u— f-v > cj > 0, 

then Perff{u,V) - Perff{v,T)) > u/2. 

Proof: Since u, v and f are unit vectors, we may assume that there exit angles 4>,9 G [0, tt] such 
that f • u = cos(0) and f • v = cos(6'), and that (/) < 9. Since the derivative of the cosine function is 
lower bounded by —1, cos((^) — cos(6') < 9 — (f). Finally observe that Perff(u, 2?) — Perff(v,I?) — 
2(err(f , v) - err(f , u)) ^ ^{9 - 4>) > uj/2. U 

The next lemma shows that Neigh' is a strictly beneficial neighborhood function. More than the 
lemma statement itself, it is the analysis of this lemma that is important to us. Below we will show 
how to extend this analysis to the case in which I? is a product normal distribution. 

Lemma 22 Let C be the class of homogeneous linear separators, TZ be the class of homogeneous lin- 
ear separators represented by unit length normal vectors, and V be a spherical Gaussian distribution. 
Define Neigh' as in the previous paragraph and let p be any polynomial such that p{n, 1/e) > 3n. 
Then Neigh' is a strictly beneficial neighborhood function for C, D, and TZ, with b{n, 1/e) = 144n/e^. 

Proof: Let p = e^ /{\2^/n) and r] = e^/(3-\/n). By assumption, /3 then satisfies rj/2 < (3 < r]. 

Consider an arbitrary r G TZn and f G C„. If there exists r' G N^ such that Perff(r',2?) — 
Perff(r,2?) > l/6(n, 1/e), then we are done, so assume that there is no element in Nfi with this 
property. In this case, we then claim that one of the following must hold for all i = 1, . . . ,n: (i) 
ri and fi have the same sign, (ii) \ri\ < p, or (iii) \fi\ < p. If none of these properties hold, then 
f • (r — 2riei) — f • r = —2rifi > 2p^ and by Lemma [21] the change in performance is at least 
p2 = l/6(n,l/e). 



In the rest of the analysis wc assume that if no flip is a beneficial mutation (by at least l/b{n, 1/e)), 
all Ti and fi are in the interval [—p, 1]. The reason this does not affect generality is this: Suppose one 
of them is smaller than — p, we know that the other one then lies in the interval [— 1, p] . We can now 
assume that the basis we were working with actually contained —Si rather than e.^. (This is useful 
for analysis, so that we can only consider mutations which increase the value of any component.) 
However, the neighborhood contains both mutations. Thus, wc may assume that all fi and r^ are 
in [-P, 1], and hence f • r > -p(||f ||i + ||r||i) > -2p^. 

In this situation if there is no i for which ri < fi — t], then f • r > 1 — e^, and we are already close 
to optimal, (as shown below) 

f-r = ^/,T, = Y. /''^''+ E ^^■^>-pMi+ E Mf^-v) 

/.e[-P,o) />G[04] /ie[04] 

> ^ ff - p\\r\\i - T]\\i\\i > I - np'^ - ^/np - ^/n7] 

/,e[04] 

Suppose there exists an i for which r^ < fi — rj, then fi > i] — p> 77/2 > 0. Let r' = r + /3ei, with 
'7/2 < (3 < rj. Then |jr'|J2 = ^/l + 2l3ri + ]P . From elementary algebra, we get the inequality that 
1 + (3ri < \J\ + 2/3ri + fP- < 1 + /Sr^ + /3^/2 (assuming /Sr^ e (—1, 1), which is true). Then consider 
the following quantity of interest: 



f • 



f-r' - v/l + /3n+/32(f . r) 



Since 1/2 < ||r'||2 < 2, if the quantity in the numerator is positive (as we will show) we have 



2 f f • j^ - f • r j > f • (r + /3e,) - ^1 + 2/3r, + /32(f • r) 

= /3/, + f-r(l- Vl + 2/3r,+/32) . (9) 

Notice that by our setting of parameters, r^ > — p > — /3/2, thus the quantity under the square root 
sign is greater than 1. When f • r < 0, the second term in the above expression is actually positive, 
and hence the total quantity is at least as much as the first term which is at least /3ry/2 = 77^/4 > 
2/6(n, 1/e). Thus we will consider the case when f • r > 0. In that case continuing from equation 
dHD and using the fact that ^1 + 2/3ri + /32 < 1 + /^r^ + /3V2, we get 

"' 1 



f-r>-(/3/.-(l-62)(/3r, + /3V2)) 



llr'||2 - 2 

Since r^ + /3/2 < ri + rj < fi, this is greater than jSfit^ /2 > 2/6(71, 1/e). Hence Neigh' is a strictly 
beneficial neighborhood function. ■ 

Product Normal Distributions 

We now describe how the analysis above can be adjusted to handle product normal distributions 
with polynomial variances. Recall that cti, . . . , cr„ are the standard deviations of the distribution T) 
in each of the ti dimensions, and that 1 > (7; > (I/ti)'^ for all i for some constant k (which is known 
by the algorithm). Assume without loss of generality that 1 = (Ti > (T2 > • • • > ct„ > (1/n)'"'. 

Define t{xi, . . . , x„) = (xi/ci, • • • , Xn/ Un) and A(a;i, . . . , a;„) = ((Tia:;i, • • ■ , cr„a:;„). Note that the 
transformations r and A are inverses. Let A/'[0, S] denote the distribution with covariance matrix 
S = diag(cr^, . . . , cr^), and let A/'[0, 1] denote the spherical normal distribution with variance 1. Note 
that if X is distributed according to A/'[0, E], r(x) is distributed according to A/'[0, 1]. For any vector 
f and r, we have 

errAA[o,s](f,r) = Prx^^[o,si[sign(A(f) • r(x)) ^ sign(A(r) • t(x)] 

= Prz~Ar[o,i][sign(A(f) • z) ^ sign(A(r) • z)] = errjv[o,i](A(f), A(r)) . 

We assume that ||f|J2 = 1 and that our representations consist of vectors also of unit norm. Then, 
for all r we have, (I/ti)'^ < ||A(r)||2 < 1. Observe that the "flips" are invariant with respect to A, 
i.e., A(r — 2riei) = A(r) + 2A(r)iei. Because of this, we can consider the same set A^a of flips as in 
the spherical distribution case. 



Unfortunately, it does not suffice to use the same set of "shift" mutations N^i- Let r be our current 
representation and let r' = r + 76^. Consider the two vectors A(r)/||A(r)||2 and A(r')/||A(r)||2, and 
consider their zth components, which are (Jiri/||A(r)||2 and ai{ri + 7)/||A(r)||2 respectively. The 
difference between the two is 0-^7/11 A(r)||2. As in the proof of Lemma [22l let 77 = e^/{3^/n). If 7 
took a certain value such that 77/2 < (Ti7/|jA(r)||2 < rj, then by the same analysis in the proof of 
Lemma \22\ this would be a beneficial mutation. 

Since we don't know the values of cTi and ||A(r)||2, we use the following trick: Let 



N.^ = {r±l^£L]e,\l<j<An'^ 



Now consider the quantity 



7i 



VJ 



||A(r)||2 4n'=- 

Observe that (l/?^)'^ < cr,;/||A(r)||2 < n*^. Thus 71 < rj/A, 74„2fc > 77, and finally 7^ — 7j_i < 77/4; at 
least one j satisfies 77/2 < -fj < 77. Let A^^i = {r'/||r'||2|r' S Ni, i g {1, . . . , 71}}. (This is the same set 
Nsi defined in Section |6^ only in slightly different notation.) With Neigh(r, e) = Na U N^i, Neigh 
is then a strictly beneficial neighborhood function with respect to any product normal distribution 
with variance lower bounded by (1/77)'^ as desired. The benefit polynomial remains the same as in 
Lemma [22I b{n, 1/e) = 14An/e^, though the neighborhood size is larger. ■ 

A. 10 Proof of Theorem [H] 

We show that Neigh is a strictly beneficial neighborhood function for C, V, and TZ with benefit 
polynomial b{n, 1/e) = 9/e^. The theorem is then an immediate consequence of Theorem[8l 

Since q = [log2(3/e)], it follows that e/6 < 2~'^ < e/3. We make use of this repeatedly. 

Consider an arbitrary r S TZn and f £ Cn- As in Diochnos and Turan Q, we define m to be 
the number of "mutual" variables shared between / and r, u to be the number of "undiscovered" 
variables that appear in / but not r, and w to be the number of "wrong" variables that appear in 
r but not /. Thus |/| = m + u and |r| = 777 + w. The functions r and / disagree if and only if all m 
mutual variables are true and either all u undiscovered variables are true while some wrong variable 
is false, or all w wrong variables are true while some undiscovered variable is false. Therefore if T> 
is uniform, erri,(/,r) = 2-" (2-"(l - 2-"") + 2-^(1 - 2"")) = 2""-" + 2-"-"' - 2i-"'^"-"', so 

Perf /(r, V) = 1 - 2^-"'-" - 2^-"'-"' + 22-'"-^-"' = 1 _ 2^-^-''^ - 2^^^''^ + 22-'"-"-"' . (10) 

We start by considering the case in which the target is "long", that is, \f\=m + u>q+l. If 
\r\ =m + w = q, then Perf /(r, P) > 1-2^'^^^ -2^'^''^ > l-2-9-2i-« = 1-3-2-9 > 1-e, andthe 
performance of r with respect to / is already good enough. This is because both / and r are almost 
always false under the uniform distribution. On the other hand, if \r\ < q, then there must exist 
a neighbor r' in the set Af~^{r) such that the variable contained in r' but not r is an undiscovered 
variable of /. Then Perf/(r',2?) - Perf/(r,2?) = 2-l'^l > 2-'? = e/6. 

Now consider the case in which the target is "short" , that is, f ~ m + u < q. Suppose that 
u — (so the variables in / are a strict subset of the variables in r). If ?« = 0, then / and r must be 
identical, so assume w > 0. Then there must exist a neighbor r' in the set M~ such that the variable 
contained in r but not r' is a wrong variable. From Equation llOl for this neighbor r', Perf /(r', V) — 
Perf/(r, 2?) = 2i-l''l > 2^-9 > fr/3, Qn the other hand, suppose that u > 0. li \r\ ^ m + w < q, then 
there must exist a neighbor r' in the set M^{r) such that the variable contained in r' but not r is 
an undiscovered variable of /. As above, Perf /(r', V) — Perf /(r, 2?) = 2-l''l > 2"' = e/6. Finally, if 
\r\ ^ ni + w = q, then there must exist a neighbor r' in the set N^ such that the variable contained 
in r but not r' is wrong and the variable contained in r' but not r is an undiscovered variable of /. 
In this case, from Equation [TOl Perf/(r',r») - Perf/(r,X>) = 22-"'-"-'" > 22-29 > £2/9. 

We have shown that whenever Perf/(r, X>) < 1 — e, there exists an r' £ Neigh(r, e) such that 
Perf/(r',X') — Perf/(r, I?) > e2/9, so Neigh is a strictly beneficial neighborhood function. ■ 

A. 1 1 Proof of Theorem [H 

We show that Neigh is a strictly beneficial neighborhood function for C, T), and TZ with benefit 
polynomial h{n, 1/e) = 9/e2. The theorem is then an immediate consequence of Theorem[51 

Consider an arbitrary r G TZn and / G C„. As in the proof of Theorem ll7[ we start by considering 
the case in which the target is "long", that is, |/| >q + 1, where q = [log2(3/e)]. If \r\ = q, then as 
before, Perf/(r,I?) > 1 — 2^-1-^1 — 2^-l''l > 1 — e, and the performance of r with respect to / is already 
good enough. If r does not contain a literal that is a negation of a literal in /, then Equation [TOl 



holds and just as in the proof of Theorem [T71 there must exist a neighbor r' in the set M^{r) such 
that Perf j(r',2?) — Perf j(r,I?) = 2~l''l > 2^^ = e/6. On the other hand, if r does contain at least 
one literal that is a negation of a literal in /, then / and r are never simultaneously true and so 
Perf/(r,I?) = l-2(2-l-^l + 2~l''l) = l-2^-\f\-2^-\''\. In this case, by a similar argument, a neighbor 
r' G 7V+ has performance 1 - 2^-\f\ - 2-l''l, so Perf/(r',r') - Perf/(r,I?) = 2-l'^l > 2-9 > e/6. 

Now consider the case in which / is "short" . If r does not contain a literal that is a negation of 
a literal in /, then Equation [10] holds and the case-by-case analysis is identical to the analysis in the 
proof of Theorem [T71 Suppose r contains at least one literal that is the negation of a literal in /. In 
this case, as above, Perf/(r, I?) = 1 — 2^^l-'l — 2^-l'"l. Let r' S Af'{r) be the conjunction obtained 
by starting with r and negating all literals in S. From Equation[TOl we have that Perf/(r',2?) > 
1 _ 21-1-^1 - 2i-l'-| -h22-l/l-l'-|, and so Perf/(r',P) - Perf/(r,2?) > 22-l/l-l'-| > 2^-'^'^ > e'^/9. The 
lemma statement follows. ■ 



